File aegypti_albopictus.csv shows information about the location and detection time of two types of mosquitoes. Both Aedes aegypti and Aedes albopictus mosquitoes may spread viruses like Zika, dengue, chikungunya and other viruses but Aedes aegypti are more likely to spread these viruses (and therefore are more dangerous).
Use MapBox interface in Plotly to create two dot maps (for years 2004 and 2013) that show the distribution of the two types of mosquitos in the world (use color to distinguish between mosquitos). Analyze which countries and which regions in these countries had high density of each mosquito type and how the situation changed between these time points. What perception problems can be found in these plots?
As can be seen from the figure below, a large number of Aedes albopictus cases were detected in Taiwan in 2004, some of which were in the United States, and the rest were scattered in countries and regions such as Mexico, South America, India, Southeast Asia, and a small number were also seen in Europe; in the same year, numerous Aedes aegypti cases were also detected in Taiwan, and the rest were scattered in countries and regions such as the United States, Mexico, South America, Africa, and Southeast Asia.
Data from 2013 showed that there were still a large number of cases of Aedes albopictus detected in Taiwan. Small numbers were also seen in Europe, but in other countries or regions, it was almost extinct. During that same year, in addition to Taiwan maintaining a certain amount of discovery of Aedes aegypti, the discovery decreased in Southeast Asia, but a huge amount of discovery appeared in South America.
The main perception problem in the plot is when there are too many data points condensed in an area which includes several countries, the points and the base map will overlap, making it difficult to distinguish the countries and the corresponding discovery locations.(like what happened in year 2013 in South America)
Compute Z as the numbers of mosquitos per country detected during all study period. Use plot_geo() function to create a choropleth map that shows Z values. This map should have an Equirectangular projection. Why do you think there is so little information in the map?
Numbers of mosquitos per country presented using choropleth map show as below.
The reason why there is so little information in the map is that all the detection cases in a country become a colour, by just quick peek at the plot, it’s hard to distinguish those countries with similar numbers of detections just by their colours. The most easy way we can do is add detailed information like z value to it’s hover on tooltip.
Create the same kind of maps as in step 2 but use
Equirectangular projection with choropleth color log (Z)
Conic equal area projection with choropleth color log (Z)
Analyze the map from step 3a and make conclusions. Compare maps from 3a and 3b and comment which advantages and disadvantages you may see with both types of maps.
From the plot of Equirectangular projection with choropleth colour log (Z), and compare the plot in 1.2, we can find it is easier to distinguish the detection case number using a colour.
This is mainly because we apply log function to the z value,and it make the range of the log(z) value get smaller.
And because the corresponding colour bar legend’s height is fixed, this make it possible to use relatively different colours to represent similar values.
Regarding the Equirectangular projection and Conic equal area projection, Pron and cons are as listed follows:
| Item | Equirectangular Projection | Conic Equal Area Projection |
|---|---|---|
| Projection Type | Cylindrical | Conic |
| Area Preservation | No | Yes |
| Shape Preservation | Poor,especially near pole area | Good |
| Distortion | High,especially near pole area | Low |
In order to resolve problems detected in step 1, use data from 2013 only for Brazil and
Create variable X1 by cutting X into 100 pieces (use cut_interval() )
Create variable Y1 by cutting Y into 100 pieces (use cut_interval() )
Compute mean values of X and Y per group (X1,Y1) and the amount of observations N per group (X1,Y1)
Visualize mean X,Y and N by using MapBox
Identify regions in Brazil that are most infected by mosquitoes. Did such discretization help in analyzing the distribution of mosquitoes?
According to the plot, the East regions in Brazil are most infected
by mosquitoes.(Those yellow points in the map, like area near
Natal)
And the South regions also have a relatively high number of
observations.(like area near Sao Paulo)
Yes, Discretization can help analyse mosquitoes’ distribution, because we can easily identify the most infected region by the colour of a tile in the grid map.
In this assignment, you will analyse the mean incomes of the Swedish households.
Download a relevant map of Swedish counties from gdam and load it into R. Read your data into R and process it in such a way that different age groups are shown in different columns. Let’s call these groups Young, Adult and Senior.
Data is downloaded from scb.se, and saved to the working folder as 000006SW_20240919-115858.csv. Since the data contains Swedish characters, we need to specify the fileEncoding to ISO-8859-1 when reading the data. Then we use mutate to create a new column “age_group” based on the age column and then select the region, age_group and house_holds columns. Finally, we use pivot_wider to create a new data frame with age_group as columns.
We also print the head rows of the new formatted data, as follows:
## # A tibble: 6 × 4
## region Young Adult Senior
## <chr> <dbl> <dbl> <dbl>
## 1 Stockholm 454 776. 805.
## 2 Uppsala 355 639. 684.
## 3 Södermanland 375 576. 598.
## 4 Östergötland 342. 592. 628.
## 5 Jönköping 390. 611. 656.
## 6 Kronoberg 362. 593. 626.
Create a plot in Plotly containing three violin plots showing mean income distributions per age group. Analyze this plot and interpret your analysis in terms of income.
Violin plot as below. We list some of the values(K SEK) for 3 age groups in the table below.
| Young | Adult | Senior | |
|---|---|---|---|
| Min value | 332 | 545 | 565 |
| Max value | 454 | 776 | 804 |
| Mean value | 367 | 594 | 619 |
| Q1 | 350 | 563 | 577 |
| Q3 | 375 | 607 | 647 |
| Upper Fence | 390 | 639 | 731 |
From the table, we can find that the min and mean income of young people are much lower than the other two groups. We also found that young people’s max income is still lower than the minimum values of the other two groups.
Based on the IQR value(Q1 to Q3), we found that most young people’s income is between 350K SEK and 375K SEK. while the Adult and Senior groups’ income is between 563K SEK to 607K SEK, and 577K SEK to 647K SEK respectively.
From the upper fence values from the 3 groups(above upper fence values are the outliers), we can know that the Adult group’s income(639K) is much lower than the Senior group’s income(731K), the gap is around 92K SEK, compare the gap between the max value of 2 groups(around 30K SEK). Although the mean values of senior and adult groups are similar, we also know that there are still many senior people who can earn more than the adult group.
From the plot, we can also find that the variability of the income distribution of 3 groups is ordered from high to low as Senior group > Adult group > Young group. which means the income of young people is more concentrated than the other two groups, while the income of senior people is more dispersed.
Create a surface plot in Plotly showing dependence of Senior incomes on Adult and Young incomes in various counties. What kind of trend can you see and how can this be interpreted? Do you think that linear regression would be suitable to model this dependence?
We can find that when Adult income increases, Senior income also increases. The same situation applies to young income vs. senior income.
Because every point on the X and Y axis plane is a county’s young and adult income data, when income data in that country is relatively higher than in other counties, the Senior income in that county is also higher than other counties. And this follows our common sense.
When we rotate the graph, we find that the surface is relatively flat, which means Senior’s income is positively correlated with Adult and Young’s income, so it is OK to use linear regression to model this dependence. However, because all groups’ income data are highly correlated, we may face problems when using linear regression to model this dependence.
Use plot_geo function with trace “choropleth” to visualize incomes of Young and Adults in two choropleth maps. Analyze these maps and make conclusions. Is there any new information that you could not discover in previous statistical plots?
Plot as follows.
We found that in the plot of young people, except for some regions, the most of the regions have similar income level.
While in the plot of adult people, we found that the income level of the south regions is higher than the north regions.
This maybe because of the young people do not have enough working experience, so they can not dictate their income.
In both plots, stockholm area always have the highest income level.
The information we can not discover in previous statistical plots is the income level north and south region, which can be clearly seen in this map by their colours.
Use GPVisualizer http://www.gpsvisualizer.com/geocoder/ and extract the coordinates of Linköping. Add a red dot to the choropleth map for Young from step 4 in order to show where we are located.
The Geo Location of Linköping is (58.4098129,15.6245251). Plot of Young people’s Income in Swedish Counties with Linköping location is as follows: